Unknown Rewards in Finite-Horizon Domains

نویسندگان

  • Colin McMillen
  • Manuela M. Veloso
چکیده

“Human computation” is a recent approach that extracts information from large numbers of Web users. reCAPTCHA is a human computation project that improves the process of digitizing books by getting humans to read words that are difficult for OCR algorithms to read (von Ahn et al., 2008). In this paper, we address an interesting strategic control problem inspired by the reCAPTCHA project: given a large set of words to transcribe within a time deadline, how can we choose the difficulty level such that we maximize the probability of successfully transcribing a document on time? Our approach is inspired by previous work on timed, zero-sum games, as we face an analogous timed policy decision on the choice of words to present to users. However, our Web-based word transcribing domain is particularly challenging as the reward of the actions is not known; i.e., there is no knowledge if the spelling provided by a human is actually correct. We contribute an approach to solve this problem by checking a small fraction of the answers at execution time, obtaining an estimate of the cumulative reward. We present experimental results showing how the number of samples and time between samples affects the probability of success. We also investigate the choice of aggressive or conservative actions with regard to the bounds produced by sampling. We successfully apply our algorithm to real data gathered by the reCAPTCHA project.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Finite-horizon variance penalised Markov decision processes

We consider a finite horizon Markov decision process with only terminal rewards. We describe a finite algorithm for computing a Markov deterministic policy which maximises the variance penalised reward and we outline a vertex elimination algorithm which can reduce the computation involved.

متن کامل

Receding Horizon based Cooperative Vehicle Control with Optimal Task Allocation

Receding Horizon based Cooperative Vehicle Control with Optimal Task Allocation Mohammad Khosravi The problem of cooperative multi-target interception in an uncertain environment is investigated in this thesis. The targets arrive in the mission space sequentially at a priori unknown time instants and a priori unknown locations, and then move on a priori unknown trajectories. A group of vehicles...

متن کامل

Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes

We analyze losses resulting from uncertain transition probabilities in Markov decision processes with bounded nonnegative rewards. We assume that policies are pre-computed using exact dynamic programming with the estimated transition probabilities, but the system evolves according to different, true transition probabilities. Our approach analyzes the growth of errors incurred by stepping backwa...

متن کامل

Forecast Horizons for a Class of Dynamic Games

In theory, a Markov perfect equilibrium of an infinite horizon, non-stationary dynamic game requires from players the ability to forecast an infinite amount of data. In this paper, we prove that early strategic decisions are effectively decoupled from the tail game, in non-stationary dynamic games with discounting and uniformly bounded rewards. This decoupling is formalized by the notion of a “...

متن کامل

The Robot Routing Problem for Collecting Aggregate Stochastic Rewards

We propose a new model for formalizing reward collection problems on graphs with dynamically generated rewards which may appear and disappear based on a stochastic model. The robot routing problem is modeled as a graph whose nodes are stochastic processes generating potential rewards over discrete time. The rewards are generated according to the stochastic process, but at each step, an existing...

متن کامل

LTL receding horizon control for finite deterministic systems

This paper considers receding horizon control of finite deterministic systems, which must satisfy a high level, rich specification expressed as a linear temporal logic formula. Under the assumption that timevarying rewards are associated with states of the system and these rewards can be observed in real-time, the control objective is tomaximize the collected rewardwhile satisfying the high lev...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008